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Abstract 

We derive ensembles of decision trees through 
a nonparametric Bayesian model, allowing us to 
view random forests as samples from a posterior 
distribution. This insight provides large gains in 
interpretability, and motivates a class of Bayesian 
forest (BF) algorithms that yield small but reli¬ 
able performance gains. Based on the BF frame¬ 
work, we are able to show that high-level tree hi¬ 
erarchy is stable in large samples. This leads to 
an empirical Bayesian forest (EBF) algorithm for 
building approximate BFs on massive distributed 
datasets and we show that EBFs outperform sub¬ 
sampling based alternatives by a large margin. 


1. Introduction 

Decision trees are a fundamental machine learning tool. 
They partition the feature (input) space into regions of re¬ 
sponse homogeneity, such that the response (output) value 
associated with any point in a given partition can be pre¬ 
dicted from the average for that of its neighbors. The classi¬ 
fication and regression tree (CART) algorithm of (Breiman 
et al., 1984) is a common recipe for building trees; it grows 
greedily through a series of partitions on features, each of 
which maximizes reduction in some measure of impurity at 
the current tree leaves (terminal nodes; i.e., the implied in¬ 
put space partitioning). The development of random forests 
(RF) by (Breiman, 2001), which predict through the aver¬ 
age of many CART trees fit to bootstrap data resamples, 
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is an archetype for the successful strategy of tree ensemble 
learning. For prediction problems with training sets that are 
large relative to the number of inputs, properly trained en¬ 
sembles of trees can predict out-of-the-box as well as any 
carefully tuned, application-specific alternative. 

This article makes three contributions to understanding and 
application of decision tree ensembles (or, forests). 

Bayesian forest: A nonparametric Bayesian (npB) point- 
of-view allows interpretation of forests as a sample from a 
posterior over trees. Imagine CART applied to a data gen¬ 
erating process (DGP) with finite support: the tree greed¬ 
ily splits support to minimize impurity of the partitioned 
response distributions (terminating at some minimum- 
leaf-probability threshold). We present a nonparametric 
Bayesian model for DGPs based on multinomial draws 
from (large) finite support, and derive the Bayesian forest 
(BF) algorithm for sampling from the distribution of CART 
trees implied by the posterior over DGPs. Random forests 
are an approximation to this BF exact posterior sampler, 
and we show in examples that BFs provide a small but re¬ 
liable gain in predictive performance over RFs. 


Posterior tree variability: Based upon this npB framework, 
we derive results on the stability of CART over different 
DGP realizations. We find that, conditional on the data al¬ 
located to a given node on the sample CART tree, the prob¬ 
ability that the next split for a posterior DGP realization 
matches the observed full-sample CART split is 


p (split matches sample CART) > 1 



(1) 


where p is the number of possible split locations and n the 
number of observations on the node. Even if p grows with 
n, the result indicates that partitioning can be stable con¬ 
ditional on the data being split. This conditioning is key: 
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CART’s well known instability is due to its recursive na¬ 
ture, such that a single split different from sample CART 
at some node removes any expectation of similarity below 
that node. However, for large samples, (1) implies that we 
will see little variation at the top hierarchy of trees in a for¬ 
est. We illustrate such stability in our examples. 

Empirical Bayesian forests: the npB forest interpreta¬ 
tion and tree-stability results lead us to propose empirical 
Bayesian forests (EBF) as an algorithm for building ap¬ 
proximate BFs on massive distributed datasets.Traditional 
empirical Bayesian analysis fixes parameters in high levels 
of a hierarchical model at their marginal posterior mode, 
and quantifies uncertainty for the rest of the model con¬ 
ditional upon these fixed estimates. EBFs work the same 
way: we fit a single shallow CART trunk to the sam¬ 
ple data, and then sample a BF ensemble of branches at 
each terminal node of this trunk. The initial CART trunk 
thus maps observations to their branch, and each branch 
BF is fit in parallel without any communication with the 
other branches. With little posterior variability about the 
trunk structure, an EBF sample should look similar to the 
(much more costly, or even infeasible) full BF sample. In a 
number of experiments, we compare EBFs to the common 
distributed-computing strategy of fitting forests to data sub¬ 
samples and find that the EBFs lead to a large improvement 
in predictive performance. This type of strategy is key to 
efficient machine learning with Big Data: focus the ‘Big’ 
on the pieces of models that are most difficult to learn. 

Bayesian forests are introduced in Section 2 along with a 
survey of Bayesian tree models. Section 3 investigates tree 
stability in theory and practice, and Section 4 presents the 
empirical Bayesian forest framework. Throughout, we use 
publicly available data on home prices in California to il¬ 
lustrate our ideas. We also provide a variety of other data 
analyses to benchmark performance, and close with de¬ 
scription of how EBF algorithms are being built and per¬ 
form in large-scale machine learning at eBay.com. 

2. Bayesian forests 

Informally, write dgp to represent the stochastic process de¬ 
fined over a set of possible DGPs. A Bayesian analogue to 
classical ‘distribution-free’ nonparametric statistical analy¬ 
sis (e.g., Hollander & Wolfe, 1999) has two components: 

1. set a nonparametric statistic T(dgp) that is of interest 
in your application regardless of the true DGP, 

2. and build a flexible model for the DGP, so that the 
posterior distribution on T(dgp) can be derived from 
posterior distribution on possible DGPs. 

In the context of this article, T (dgp) refers to a CART tree. 
Indeed, trees are useful precisely because they are good 


predictors regardless of the underlying data distribution - 
they do not rely upon distributional assumptions to share 
information across training observations. Our DGP model, 
detailed below, leads to a posterior for dgp that is repre¬ 
sented through random weighting of observed support. A 
Bayesian forest contains CART fits corresponding to each 
draw of support weights, and the BF ensemble prediction 
is an approximate posterior mean. 

2.1. Nonparametric model for the DGP 

We employ a Dirichlet-multinomial sampling model in 
nonparametric Bayesian analysis. The approach dates back 
to Ferguson (1973). Chamberlain & Imbens (2003) provide 
an overview in the context of econometric problems. Rubin 
(1981) proposed the Bayesian bootstrap as an algorithm for 
sampling from versions of the implied posterior, and it has 
since become closely associated with this model. 

Use z, = {x,. y,} to denote the features and response for 
observation i. We suppose that data are drawn indepen¬ 
dently from a finite L possible values, 

L 

dgp = p(z) = ^2 <*>/![»=<i] ( 2 ) 

;=l 

where uji > CM and Y^i w i = 1- Thus the generat¬ 
ing process for observation i draws l, from a multinomial 
with probability u;/,, and this indexes one of the L support 
points. Since L can be arbitrarily large, this so-far implies 
no restrictive assumptions beyond that of independence. 

The conjugate prior for cu is a Dirichlet distribution, written 
Dir(u;; u) oc nti"? 1 - We will parametrize the prior 
with a single concentration parameter u = a > 0, such that 
E[wj] = a/La = 1/L and var(cu;) = (L—l)/[L 2 (La+l)\. 
Suppose you have the observed sample Z = [z 1 - ■ ■ z.,,]'. 
For convenience, we allow Ci = Ck for / A; in the case 
of repeated values. Write li ... l n = 1... n so that z, = <^ i 
and Z = [Ci ’ ’ ’ C n] / ■ Then the posterior distribution for ui 
has uji = a + 1 for i < n and u>i = a for l > n, so that 

n L 

i»m * ik n w “ _i - (3) 

i =1 Z=n+1 

This, in turn, defines our posterior for the data generating 
process through our sampling model in (2). 

There are many possible strategies for specification of a 
and Q for l > n. 1 The non-informative prior that arises 
as a —> 0 is a convenient default: in this limit, uji = 0 
with probability one for l > n. 2 We apply this limiting 

*The unobserved Q act as data we imagine we might have 
seen, to smooth the posterior away from the data we have actually 
observed. See Poirier (2011) for discussion. 

2 For l > n the posterior has E[ui(] = 0 with variance 
var(uti) = lim a _>o a[n+a(L — l)]/[(n+La) 2 (n+La-\-l)\ = 0. 
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sample CART tree Bayesian tree draw 




Bayesian forest 



Figure 1. Motorcycle data illustration. The lines show each of CART fit to the data sample, CART fit to a sample with weights drawn 
IID from an exponential (point size proportional to weight), and the BF predictor which averages over 100 weighted CART fits. 


prior throughout, such that our posterior for the data gen¬ 
erating process is a multinomial draw from the observed 
data points, with a uniform Dir(l) distribution on the lu = 
[oji ... u n ]' sampling probabilities. We will also find it con¬ 
venient to parametrize un-normalized uj via IID exponen¬ 
tial random variables: 9 =[9\.. . 9 n ], where Exp(l) 
in the posterior and u>i = 9i/\9\ with \9\ = JT Q i . 

2.2. CART as posterior functional 

Conditional upon 9 , the population tree T ( dgp ) is defined 
through a weighted-sample CART fit. In particular, given 
data Z v = {X 17 , y 17 } in node 77 , sort through all dimensions 
of all observations in Z - ' 7 to find the split that minimizes 
the average of some w-weighted impurity metric across the 
two new child nodes. For example, in the case of regression 
trees, the impurity to minimize is weighted-squared error 

I ^y v ) = di ^ i ~ v v ) 2 ( 4 ) 

with p n = 9ii/i/\9 v \. For our classification trees, we 

minimize Gini impurity. The split candidates are restricted 
to satisfy a minimum leaf probability, such that every node 
in the tree must have \9''\ greater than some threshold . 3 
This procedure is repeated on every currently-terminal tree 
node until it is no longer possible to split while satisfying 
the minimum probability threshold. To simplify notation, 
we refer to the resulting CART tree as T(9). 

2.3. Posterior sampling 

Following (Rubin, 1981) we can sample from the posterior 
on T {9) via the Bayesian bootstrap in Algorithm 1. 

We’ve implemented BF through simple adjustment of the 
ensemble module of python’s scikit-learn (Pe- 

3 In practice this can be replaced with a threshold on the mini¬ 
mum number of observations at each leaf. 


Algorithm 1 Bayesian Forest 
for b = 1 to B do 

draw 9 b ~ Exp(l) 

run weighted-sample CART to get % = T(9 h ] 

end for 


dregosa et al., 2011 ). 4 As a quick illustration, Figure 1 
shows three fits for the conditional mean velocity of a mo- 
torcyle helmet after impact in a crash: sample CART T(l), 
a single draw of T{9), and the BF average prediction (data 
are from the MASS R package, Venables & Ripley, 2002). 

Note that the Bayesian forest differs from Breiman’s ran¬ 
dom forest only in that the weights are drawn from an expo¬ 
nential (or Dirichlet, when normalized) distribution rather 
than a Poisson (or multinomial) distribution. To the ex¬ 
tent that RF sampling provides a coarse approximation to 
the BF samples, the former is a convenient approximation. 
Moreover, we will find little difference in predictive perfor¬ 
mance between BFs and RFs, so that one should feel free 
to use readily available RF software while still relying on 
the ideas of this paper for intuition and interpretation. 

2.4. Bayesian tree-as-parameter models 

Other Bayesian analyses of DTs treat the tree as a parame¬ 
ter which governs the DGP, rather than a functional thereof, 
and thus place some set of restrictions on the distributional 
relationship between inputs and outputs. 

The original Bayesian tree model is the Bayesian CART 
(BCART) of Chipman et al. (1998). BCART defines a like¬ 
lihood where response values for observations allocated 
(via x) to the same leaf node are IID from some para¬ 
metric family (e.g., for regression trees, observations in 

4 Replace variable sample_counts in forest.py to be 
drawn exponential rather than multinomial when bootstrapping. 













Bayesian and Empirical Bayesian Forests 


each leaf are IID Gaussian with shared variance and mean). 
The BCART authors also propose linear regression leaves, 
while the Treed Gaussian Processes of Gramacy & Lee 
(2008) use Gaussian process regression at each leaf. The 
models are fit via Markov chain Monte Carlo (above ex¬ 
amples) or sequential Monte Carlo (Taddy et al., 2011) al¬ 
gorithms that draw from the posterior by proposing small 
changes to the tree (e.g., grow or prune ). 3 * 5 

Monte Carlo tree sampling use the same incremental moves 
that employed in CART. Unfortunately, this means that 
they tend to get stuck in locally-optimal regions of tree- 
space. Bayesian Additive Regression Trees (BART; Chip- 
man et al., 2010 ) replace a single tree with the sum of many 
small trees. An input vector is allocated to a leaf in each 
tree, and the corresponding response distribution has mean 
equal to the average of each leaf node value and variance 
shared across observations. Original BART has Gaussian 
errors, and extensions include Gaussian mixtures. Since 
BART only samples short trees, it is fast and mixes well. 

Easy sampling comes at a potentially steep price: the as¬ 
sumption of homoskedastic additive errors . 6 Despite this 
restriction, empirical studies have repeatedly shown BART 
to outperform alternative flexible prediction rules. Many 
response variables (especially after, say, log transforma¬ 
tion) have the property that they are well fit by flexible re¬ 
gression with homoskedastic errors. Whenever the model 
assumptions in BART are close enough to true, it should 
outperform methods which do not make those assumptions. 

In contrast, the npB interpretation of forests (BF or RF) 
makes it clear that they are suited to applications where 
the response distribution defies parametric representation, 
such that CART fit is the most useful DGP summary avail¬ 
able. We often encounter this situation in application. For 
example, internet transaction data combines discrete point 
masses with an incredibly fat right tail (e.g., see Taddy 
et al., 2014). In academia it is common to transform such 
data before analysis, but businesses wish to target the re¬ 
sponse on the scale measured (e.g., clicks or dollars) and 
need to build a predictor that does well on that scale. 

2.5. Friedman example 

A common simulation experiment in evaluating flexible 
predictors is based around the Friedman (1991) function, 

Z/ = /( x )+£ (5) 

= 10sin(7ra;ia;2) + 20 ( 0:3 — 0.5 ) 2 + 10x4 + 5x5 + £• 

where e ~ N(0,1) and Xj ~ U(0,1). 

3 The Bayesian bootstrap is also a potential sampling tool in 

this tree-as-parameter setting. See Clyde & Lee (2001) for details 

on the technique and its relation to model averaging. 

6 For classification, this is manifest through a probit link. 



Figure 2. Friedman experiment predictive RMSE over 100 runs. 


For our experiment, we follow previous authors by includ¬ 
ing as features for training the spurious Xq ... x p . Each re¬ 
gression models are fit to 100 random draws from (5) and 
tested at 1000 new x locations. Root mean square error 
(RMSE) is calculated between predicted and tme /(x). 

Results over 100 repeats are shown in Figure 2 . 7 As fore¬ 
cast, the only model which assumes the tme homoskedas¬ 
tic error structure, BART, well outperforms all others. The 
two forests, BF and RF, are both a large improvement over 
DT. The BF average RMSE is only about 1% better than 
the RF’s, since Bayesian and classical bootstrapping dif¬ 
fer little in practice. BCART does very poorly: worse than 
a single DT. We hypothesis that this is due to the notori¬ 
ously poor mixing of the BCART MCMC, such that this fit 
is neither finding a posterior mean (as it is intended to) or 
optimizing to a local posterior mode (as DT does). 

We also include the extremely randomized trees (ET) of 
Geurts et al. (2006), which are similar to RFs except that (a) 
instead of optimizing greedy splits, a few candidate splits 
are chosen randomly and the best is used and (b) the full 
unweighted data sample is used to fit each tree. ET slightly 
outperforms both BF and RF; in our experience this hap¬ 
pens in small datasets where the restriction of population 
support to observed support (as assumed in our npB analy¬ 
sis) is invalid and the forest posteriors are over-fit. 

2.6. California housing data 

As more realistic example, we consider the California 
housing data of Pace & Barry (1997) consisting of median 
home price along with eight features (location, income, 
housing stock) for 20640 census blocks. Since prices tend 

7 In this and the next example, CART-based algorithms had 
minimum-leaf-samples set at 3 and the ensembles contain 100 
trees. BART and BCART run at their bayestree and tgp R 
package defaults, except that BART draws only 200 trees after a 
burn-in of 100 MCMC iterations. This is done to get comparable 
compute times; for these settings BART requires around 75 sec¬ 
onds per fold with the California housing data, compared to BF’s 
10 seconds when run in serial, and 3 seconds running on 4 cores. 
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Figure 3. California housing data. The response sample is on the left, and the right panel shows predictive RMSE across 10-fold CV. 



BF 

RF 

EBF 

ET 

SSF 

BART 

DT 

RMSE 

48.2 

48.5 

49.4 

52.5 

53.1 

54.8 

65.6 

%WTB 

0.0 

0.5 

2.4 

8.7 

10.0 

13.4 

35.9 


Table 1. Average RMSE in $10k and % worse-than-best for the 
California housing data 10-fold CV experiment. 

to vary with covariates on a multiplicative scale, standard 
analyses take log median price as the response of interest. 
Instead, we will model the conditional expectation of dol¬ 
lar median price. This is relevant to applications where 
prediction of raw transaction data is of primary interest. 
The marginal response distribution for median home price 
is shown in the left panel of Figure 3. 8 

Figure 3 shows results from a 10-fold cross-validation (CV) 
experiment, with details in Table 1. Except for DT and 
BCART, which still perform worse than all others 9 , re¬ 
sults are reversed from those for the small-sample and ho- 
moskedastic Friedman data. Both RF and BF do much bet¬ 
ter than BART and it’s restrictive error model. BF again 
offers a small gain over RF. Since the larger sample size 
makes observed support a better approximation to popula¬ 
tion support, the forests outperform ET. The EBF (empiri¬ 
cal Bayesian forest) and SSF (sub-sample forest) predictors 
are based on distributed computing algorithms that we in¬ 
troduce later in Section 3. At this point we note only that 
the massively scalable EBF is amongst the top performers; 
the next section helps explain why. 

3. Understanding the posterior over trees 

Theory on decision trees is sparse. The original CART 
book of Breiman et al. (1984) provides consistency results; 
they show that any partitioner that is able to learn enough 

8 Values appear to have been capped at $500k 

9 We’ve left off BCART’s average RMSE of $82k, 70% WTB. 


to split the DGP support into very low probability leaves 
will be able to reproduce the conditional response distribu¬ 
tion of that DGR However, this result says nothing about 
the structure of the underlying trees, nor does it say any¬ 
thing about the ability of a tree to predict when there is not 
enough data to finely partition the DGR Others have fo¬ 
cused on the frequentist properties of individual split deci¬ 
sions. In the CHAID work of Kass (1980), splits are based 
upon x 2 tests at each leaf node. Loh (2002) and Hothorn 
et al. (2006) are generalizations, both of which combat 
multiple testing and other biases inherent in tree-building 
through a sequence of hypothesis tests. However, such con¬ 
tributions provide little intuition in the setting where we are 
not working from a no-split null hypothesis distribution. 

Despite this lack of theory, it is generally thought that there 
is large uncertainty (sampling or posterior) about the struc¬ 
ture of decision trees. For example, Geurts & Wehenkel 
(2000) present extensive simulation of tree uncertainty and 
find that the location and order of splits is often no-better 
than random (this motivates work by the same authors on 
extremely randomized trees). The intuition behind such 
randomness is clear: the probability of a tree having any 
given branch structure is the product of conditional proba¬ 
bilities for each successive split. After enough steps, any 
specific tree approaches probability of zero. 

However, it is possible that elements of the tree structure 
are stable. For example, in the context of boosting, Appel 
et al. (2013) argue that the conditionally optimal split loca¬ 
tions for internal nodes can be learned from subsets of the 
full data allocated to each node. They use this to propose a 
faster boosting algorithm. In this section we make a related 
claim: in large samples, there is little posterior variation for 
the top of the tree. We make the point first in theory, then 
through empirical demonstration. 
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3.1. Probability of the sample CART tree 

We focus on regression trees for this derivation, wherein 
node impurity is measured as the sum of squared errors. 
Consider a simplified setup with each Xj £ {0,1} a binary 
random variable (possibly created through discretization of 
a continuous input). We’ll investigate here the probability 
that the impurity minimizing split on a given node is the 
same for a given realization of posterior DGP weights as it 
is for the unweighted data sample. 

Suppose {zi ... z n } are the data to be partitioned at some 
tree node, with Zj = [y i: Xu ,..., Xi p \'. Say that f j = {i : 
Xij = 0 } and t j = {i : Xij = 1 } are the partitions implied 
by splitting on a given Xj. The resulting impurity is 

= \'52 e i[yi~ Vj( x ij)] 2 > ( 6 ) 

i 

= Eie fj Vi @i/ I’ Mj(l) = Ei£ tj 2/* Oil \ I - 

We could use the Bayesian bootstrap to simulate the pos¬ 
terior for (jj implied by the exponential posterior on 9, but 
an analytic expression is not available. Instead, we follow 
the approach used in Lancaster (2003), Poirier (2011), and 
Taddy et al. (2014): derive a first-order Taylor approxima¬ 
tion to the function tjj and describe the posterior for that 
related functional. 


In particular, the lxn gradient of <jj with respect to 0 is 

Vcrf = V- 

J n 


T-o-y 2 . - 


(7) 


which has elements Vi<rJ = {yi — p(xij)) /n 
The Taylor approximation is then 

~ ^ = °j(i) + Va X=i( 0 - 1) 


( 8 ) 


with yj( 0 ) = Eier, Vi and %(!) = ^7 E ietj W the 

r J J . . J J 

observed response averages in each partition. 


Suppose that ai(l) < cr ; (1) Vj, so that variable T’ is that 
selected for splitting based on the unweighted data sam¬ 
ple. Then we can quantify variability about this selection 
by looking at differences in approximate impurity. 


AjiO) = & 1 - a: 
1 
n 


(9) 


= - Yl 9i [ y i ( x ii ) - Vj C x a ) - 

2 y* {yi{xu) - Vjfaij))] 


l 

n 


Oidji- 


Say dj = [dji ... dj n ]' is the vector of squared error dif¬ 
ferences. Then the total difference has mean Hi A, = dj 
and variance varA, = d' d j/n 2 . Since A, is the mean 
of independent Exponential random variables with known 
means and variances, the central limit theorem applies and 
it converges in distribution to a Gaussian: 

y/n(Aj(0) - dj) N(0, djdj/ri). (10) 

The weighted-sample impurity-minimizing split matches 
that for the unweighted-sample if and only if all A j are 
negative, which occurs with probability (note dj < 0 ) 

P 

P(Aj < 0 : j = 2...p) > l-^p(A,- >0) (11) 

j =2 

v—^ I d j I 

1 - - ! 1 J| 

3=2 V \J d i d i/ n 

> l _1 exp(—nZj/2) 

z jVn 

where Zj = \dj | (d'd j/n) 2 is sample mean increase in 
impurity over the sample standard deviation of impurity. 
This ratio is bounded in probability, so that that the expo¬ 
nential bound goes to zero very quickly with n. In par¬ 
ticular, ignoring variation in Zj across variables, we arrive 
at the approximate lower bound on the probability of the 
sample split 

p (posterior split matches sample split) 

Even allowing for p ss n x d, with d some underlying 
continuous variable dimension and p the input dimension 
discretization on these variables, the probability of the ob¬ 
served split goes to one at order 0(n -1 ) if d < e n /n 3 / 2 . 

Given this, why is there any uncertainty at all about trees? 
The answer is recursion: each split is conditionally stable 
given the sample at the current node, but the probability 
of any specific sequence of splits is roughly (we don’t ac¬ 
tually have independence) the product of individual node 
split probabilities. This will get small as we move deeper 
down the tree, and given one split different from the sam¬ 
ple CART the rest of the tree will grow arbitrarily far from 
this modal structure. In addition, the sample size going into 
our probability bounds is shrinking exponentially with each 
partition, whereas the dimension of eligible split variables 
is reduced only by one at each level. 

Regardless of overall tree variability, we can take from this 
section an expectation that for large samples the high-level 
tree structure varies little across posterior DGP realizations. 
The next section shows this to be the case in practice. 
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Figure 4. California housing data. The left shows the CART fit on unweighted data with a minimum of 3500 samples-per-leaf. The right 
panel shows the distribution of first and second split locations - always on median income - across 100 draws from the BF posterior. 


3.2. Trunk stability in California 

We’ll illustrate the stability of high-level tree structure on 
our California housing data. Consider a ‘trunk’ DT fit to the 
unweighted data with no less than 3500 census blocks (out 
of 20640 total) in each leaf partition. This leads to the tree 
on the left of Figure 4 with five terminal nodes. It splits on 
the obvious variable of median income, and also manages 
to divide California into its north and south regions (34.455 
degrees north is just north of Santa Barbara). 

To investigate tree uncertainty, we can apply the BF algo¬ 
rithm and repeatedly fit similar CART trees to randomly- 
weighted data. Running a 100 tree BF, we find that the 
sample tree occurs 62% of the time. The second most com¬ 
mon tree, occurring 28% of the time, differs only in that it 
splits on median income again instead of on housing me¬ 
dian age. Thus 90% of the posterior weight is on trees that 
split on income twice, and then latitude. Moreover, a strik¬ 
ing 100% of trees have first two splits on median income. 
From this, we can produce the plot in the right panel of Fig¬ 
ure 4 showing the locations of these first two splits: each 
split-location posterior is tightly concentrated around the 
corresponding sample CART split. 

4. Empirical Bayesian forests 

Empirical Bayes (EB) is an established framework with a 
successful track record in fast approximate Bayesian infer¬ 
ence; see, e.g., Efron (2010) for an overview. In parametric 
models, EB can be interpreted as fixing at their marginal 
posterior maximum the parameters at higher levels of a 
hierarchical model. For example, in the simple setting of 
many group means (the average student test score for each 
of many schools) shrunk to a global center (the outcome 
for an imaginary ‘average school’), an EB procedure will 
shrink each group mean toward the overall sample average 
(each school towards all-student average). Kass & Steffey 


(1989) investigate such procedures and show that, under 
fairly general conditions, the EB conditional posterior for 
each group mean quickly approaches the fully Bayesian un¬ 
conditional posterior as the sample size grows. This occurs 
because there is little uncertainty about the global mean. 

CART trees are not a parametric model, but they are hierar¬ 
chical and admit an interpretation similar to those studied in 
Kass & Steffey (1989). The data which reaches any given 
interior node is a function of the partitioning implied by 
nodes shallower in the tree. Moreover, due to the greedy 
algorithm through which CART grows, a given shallow 
trunk is unaffected by changes to the tree structure below. 
Finally, Section 3 demonstrated that, like high levels in a 
parametric hierarchical model, there is relatively little un¬ 
certainty about high levels of the tree. 

An empirical Bayesian forest (EBF) takes advantage of this 
structure by fixing the highest levels in the hierarchy - the 
earliest CART splits - at a single estimate of high poste¬ 
rior probability. 10 We fit a single shallow CART trunk to 
the unweighted sample data, and sample a BF ensemble of 
branches at each terminal node of this trunk. The initial 
CART trunk maps observations to their branch, and each 
branch BF deals with a dataset of manageable size and can 
be fit in parallel without any communication with the other 
branches. That is, EBF replaces the BF posterior sample of 
trees with a conditional posterior sample that takes the pre¬ 
fit trunk as fixed. Since the trunk has relatively low vari¬ 
ance for large samples (precisely the setting where such 
distribution is desirable), the EBF should provide predic¬ 
tions similar to that of the full BF at a fraction of the cost. 

In contrast, consider the ‘sub-sample forest’ (SSF) al¬ 
gorithm, which replaces the full data with random sub¬ 
samples of manageable size. Independent forests are fit 

l0 These earliest splits - requiring impurity search over large 
observation sets - are also the most expensive to optimzie. 
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RMSE 

% WTB 


0.5905 

0.0 

BF 

0.5953 

0.8 

EBF 

0.6607 

11.9 

SSF 

0.7648 

29.5 

DT 



MCR 

% WTB 


0.4341 

0.0 

BF 

0.4531 

4.4 

EBF 

0.5989 

38.0 

SSF 

0.6979 

60.8 

DT 


Figure 5. Wine Data: 10-fold 00S prediction experiment Root Figure 6. Beer Data: 10-fold OOS prediction experiment Miss- 
Mean Square Error and % Worse than Best for the mean RMSE. Classification Rate and % Worse than Best for the average MCR. 


to each sub-sample and predictions are averaged across all 
forests. SSF is a commonly applied strategy (e.g., see men¬ 
tion, but not recommendation, of it in Panda et al., 2009), 
but it implies using only partial data for learning deep tree 
structure. Although the tree trunks are stable, the full tree 
is highly uncertain and learning such structure is precisely 
where you want to use a full Big Data sample. 

4.1. Out-of-sample experiments 

In the California housing experiment of Figure 3 and Table 
1, EBF 11 predictions are only 2% worse than those from 
the full BF. In contrast, SSF predictions are 10% worse. 

We consider two additional prediction problems. The first 
example, taken from (Cortez et al., 1998), involves predic¬ 
tion of an ‘expert’ quality ranking on the scale of 0-10 for 
wine based upon 11 continuous attributes (physiochemi- 
cal properties of the wine) plus wine color (red or white). 
There are 4898 observations. Results are in Figure 5: EBF 
is only 1% worse than the full BF, while SSF is 12% worse. 

The second example is from the Nielson Consumer Panel 
data, available for academic research through the Kilts 
Center at Chicago Booth, and our sample contains 73,128 
purchases of light beer in a number of US markets during 
2004. The response of interest is brand choice, of a possible 
five major light beer labels: Budweiser, Miller, Coors, Nat¬ 
ural, and Busch. Each purchase is associated with a house¬ 
hold that is codified through 16 standard demographic cat¬ 
egorical variables (maximum age, total income, main occu¬ 
pation, etc). Applying classification forests and DTs (based 
on Gini impurity) leads to the results in Figure 6: EBF is 
only 4.4% worse than the BF, while SSF is 38% worse. 

4.2. Choosing the trunk depth 

From the distributed computing perspective which moti¬ 
vates this work, you should fix the trunk only as deep as 
you must. In our full scale applications (next section), this 

11 EBFs use five node trunks in this Section. The SSFs are fit 
on data split into five equally sized subsets. 


means setting branch size so that the working memory re¬ 
quired to fit a forest at each branch is slightly less than the 
memory available on each machine. 

This leaves open questions, e.g., on the trade-off between 
number of trees in the forest and depth of the trunk. We 
don’t yet have rigorous answers, but the npB framework 
can help in further investigation. In the interim. Table 2 
shows the effect on OOS error from doubling and halving 
the minimum leaf size for the EBFs in our three examples. 



CA housing 

Wine 

Beer 

MLS 10 3 

6 3 1.5 

2 1 

0.5 

20 

10 

5 

% WTB 

1.6 2.4 4.3 

0.3 0.8 

2.2 

1.0 

4.4 

7.6 


Table 2. % Worse than Best on OOS error for different minimum 
leaf sizes (MLS) specified in thousands of observations. 

5. Scaling for Big Data 

To close, we note that this work is motivated by the need 
for reliable forest fitting algorithms that can be deployed 
on millions or hundreds of millions of observations, as en¬ 
countered when analyzing internet commerce data. A num¬ 
ber of additional engineering details are required for such 
deployment, but the basic approach is straightforward: an 
initial trunk is fit (possibly to a data subset) 12 , and this trunk 
acts as a sorting function to map observations to separate 
locations corresponding to each branch. Forests are then fit 
at each location (machine) for each branch. 

Preliminary work at eBay.com applies EBFs for prediction 
of ‘Bad Buyer Experiences’ (e.g. complaints, returns, or 
shipping problems) on the site. Training on a relatively 
small sample of 12 million transactions, the EBF algorithm 
using 32 branch chunks is able to provide a 1.3% drop in 
misclassification over the SSF alternatives. This amounts 
to more than 20,000 extra detected BBE occurrences over 
the short sample window, potentially giving eBay the op¬ 
portunity to try and stop them before they occur. 

l2 Note that the original CART trunk can itself be fit in distribu¬ 
tion, e.g. using the MLLib library for Apache Spark. 
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